1 Introduction

The intent of this file is to sintetize EDA approaches learned during Udacity’s Nanodegree in R language. For the analysis, a database from a Chicago bike sharing company will be used.

2 Preparação

First, let’s download all libraries that will be used, as well as the bike sharing database.

2.2 Database Import

## [1] "data.frame"

3 EDA

3.1 Summary

Let’s check how many rows and columns our dataset have, as well as check the summary of all the variables:

## [1] 1551505       8
##                Start.Time                     End.Time      
##  2017-04-15 13:27:10:      7   2017-05-21 13:07:49:      8  
##  2017-05-29 15:19:36:      7   2017-06-10 13:31:26:      8  
##  2017-04-15 13:02:35:      6   2017-02-18 14:24:00:      7  
##  2017-04-26 16:07:17:      6   2017-06-02 17:16:46:      7  
##  2017-05-15 17:14:43:      6   2017-06-03 14:32:22:      7  
##  2017-05-16 17:56:13:      6   2017-06-10 18:47:17:      7  
##  (Other)            :1551467   (Other)            :1551461  
##  Trip.Duration                          Start.Station    
##  Min.   :   60.0   Streeter Dr & Grand Ave     :  36686  
##  1st Qu.:  392.0   Clinton St & Washington Blvd:  22429  
##  Median :  670.0   Lake Shore Dr & Monroe St   :  21822  
##  Mean   :  939.8   Clinton St & Madison St     :  19098  
##  3rd Qu.: 1127.0   Canal St & Adams St         :  18105  
##  Max.   :86338.0   Lake Shore Dr & North Blvd  :  17729  
##                    (Other)                     :1415636  
##                        End.Station           User.Type      
##  Streeter Dr & Grand Ave     :  39537   Customer  : 317162  
##  Clinton St & Washington Blvd:  21113   Dependent :      4  
##  Clinton St & Madison St     :  20969   Subscriber:1234339  
##  Lake Shore Dr & Monroe St   :  20652                       
##  Lake Shore Dr & North Blvd  :  19734                       
##  Theater on the Lake         :  18529                       
##  (Other)                     :1410971                       
##     Gender         Birth.Year    
##        :316867   Min.   :1899    
##  Female:298784   1st Qu.:1975    
##  Male  :935854   Median :1984    
##                  Mean   :1981    
##                  3rd Qu.:1989    
##                  Max.   :2016    
##                  NA's   :316683

Our dataframe has around 1,5M rows and 8 columns. The variables goes as following:

  • Start.Time = timestamp when the user starts a ride
  • End.Time = timestamp when the user finishes a ride
  • Trip.Duration = for how long the user rides the rented bike (in seconds)
  • Start.Station = station where the user got the bike
  • End.Station = station where the user left the bike
  • User.Type = can be customer, dependent or subscriber
  • Gender
  • Birth.Year

Let’s take a look at the first 20 rows of the dataset:

##             Start.Time            End.Time Trip.Duration
## 1  2017-01-01 00:00:36 2017-01-01 00:06:32           356
## 2  2017-01-01 00:02:54 2017-01-01 00:08:21           327
## 3  2017-01-01 00:06:06 2017-01-01 00:18:31           745
## 4  2017-01-01 00:07:28 2017-01-01 00:12:51           323
## 5  2017-01-01 00:07:57 2017-01-01 00:20:53           776
## 6  2017-01-01 00:10:44 2017-01-01 00:21:27           643
## 7  2017-01-01 00:11:34 2017-01-01 00:23:47           733
## 8  2017-01-01 00:14:57 2017-01-01 00:26:22           685
## 9  2017-01-01 00:15:03 2017-01-01 00:26:28           685
## 10 2017-01-01 00:17:01 2017-01-01 00:29:49           768
## 11 2017-01-01 00:17:13 2017-01-01 11:03:34         38781
## 12 2017-01-01 00:18:28 2017-01-01 00:31:05           757
## 13 2017-01-01 00:18:50 2017-01-01 00:21:47           177
## 14 2017-01-01 00:23:41 2017-01-01 00:29:13           332
## 15 2017-01-01 00:25:47 2017-01-01 00:39:53           846
## 16 2017-01-01 00:25:47 2017-01-01 00:43:23          1056
## 17 2017-01-01 00:26:21 2017-01-01 00:39:40           799
## 18 2017-01-01 00:27:21 2017-01-01 00:42:59           938
## 19 2017-01-01 00:27:28 2017-01-01 00:42:44           916
## 20 2017-01-01 00:27:45 2017-01-01 00:31:13           208
##                           Start.Station                   End.Station
## 1                  Canal St & Taylor St      Canal St & Monroe St (*)
## 2            Larrabee St & Menomonee St  Sheffield Ave & Kingsbury St
## 3  Orleans St & Chestnut St (NEXT Apts)    Ashland Ave & Blackhawk St
## 4               Franklin St & Monroe St        Clinton St & Tilden St
## 5                  Broadway & Barry Ave       Sedgwick St & North Ave
## 6                  State St & Kinzie St            Wells St & Polk St
## 7                Wabash Ave & Wacker Pl        Clinton St & Tilden St
## 8                    Daley Center Plaza      Canal St & Monroe St (*)
## 9                    Daley Center Plaza      Canal St & Monroe St (*)
## 10                Dayton St & North Ave       Ogden Ave & Chicago Ave
## 11           Wilton Ave & Diversey Pkwy   Halsted St & Wrightwood Ave
## 12                Canal St & Madison St      LaSalle St & Illinois St
## 13                  Theater on the Lake Lakeview Ave & Fullerton Pkwy
## 14              Halsted St & Maxwell St          Halsted St & 18th St
## 15        Ravenswood Ave & Lawrence Ave    Clarendon Ave & Gordon Ter
## 16             Clark St & Congress Pkwy         Wolcott Ave & Polk St
## 17        Ravenswood Ave & Lawrence Ave    Clarendon Ave & Gordon Ter
## 18                      Millennium Park        Michigan Ave & 18th St
## 19                      Millennium Park        Michigan Ave & 18th St
## 20              Damen Ave & Chicago Ave       Damen Ave & Division St
##     User.Type Gender Birth.Year
## 1    Customer                NA
## 2  Subscriber   Male       1984
## 3  Subscriber   Male       1985
## 4  Subscriber   Male       1990
## 5  Subscriber   Male       1990
## 6  Subscriber   Male       1970
## 7  Subscriber   Male       1986
## 8    Customer                NA
## 9    Customer                NA
## 10   Customer                NA
## 11 Subscriber Female       1988
## 12   Customer                NA
## 13 Subscriber   Male       1991
## 14 Subscriber   Male       1984
## 15 Subscriber Female       1987
## 16 Subscriber   Male       1984
## 17 Subscriber   Male       1987
## 18 Subscriber   Male       1991
## 19 Subscriber Female       1990
## 20 Subscriber   Male       1982

3.4 Gender

Let’s check how many male riders we have:

## # A tibble: 1 x 1
##        n
##    <int>
## 1 935854

What about female riders?

## # A tibble: 1 x 1
##        n
##    <int>
## 1 298784

Let’s check how each Gender compare considering their average trip duration:

## df$Gender: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      60     927    1379    1875    1962   86338 
## -------------------------------------------------------- 
## df$Gender: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    60.0   398.8   648.0   783.7  1010.0 85782.0 
## -------------------------------------------------------- 
## df$Gender: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    60.0   338.0   541.0   673.1   861.0 86096.0

Although there’s significantly more male riders than female ones, the female have longer trip durations on average. We can check how they compare using a frequency polygon:

We can see that the female normal distribution is slightly more left-skewed than the male curve, validating what we saw earlier.

3.5 Birth year

## Warning: Removed 316683 rows containing non-finite values (stat_bin).

Most users birth year falls between 1950 and 2000, with a peak around 1980-1990.

Distribution of types of user:

We can see that most of the riders are actual subscribers.

Using the plotly package, we can creat the trip duration per birth year of users, also showing the start and end station. Since the chart is heavy for the amout of data, we will use a 10k sample of the data.

## Warning: Ignoring 2053 observations

We can see that the majority of trips stay under the 10k seconds threshold and user age doesn’t really seem to have an impact on it. The ride with most duration departed from Clinton st and lasted for more than 74K seconds (or more than 20 hours!). It’s probably a user who forgot to return the bike =D

Since the Female riders seems to have longer trips on average on age hasn’t prove to have much different on trip duration, let’s try grouping our data by age and gender and check the average trip duration for the group:

For the birthday range where we have most of the datapoints (between 1950 and 2000), we can see that Female riders consistently have a higher trip duration than Male riders.

Let’s check the same analysis, but for customers x subscribers:

## Warning: Removed 1 rows containing non-finite values (stat_summary).

Customers have longer trip than subscribers. Could that be a consequence of more female riders as customers than subscribers?

3.6 Stations

Finally, let’s check which are the 5 most popular stations on both starting and finishing rides:

Interesting! The 2 most popular start stations are also the 2 most popular end stations. Also, both Clinton St & Madison St and Lake Shore Dr & Monroe St appear in the top 5, although in different positions.

Let’s check how many (if any) trips start and finish in the same station by adding a new column to our dataset:

## [1] 0

There are not trips starting and ending at the same station.

4 Final Plots

I’ve chosen the charts that compare trip duration vs age for both gender and user type since they are the most insigthful charts as described later in the reflection. Also the trip duration distribution using plotly shows how concentrated the trips are and how age have a low influence in this specific variable.

## Warning: Ignoring 2053 observations

5 Reflection

Based on the analysis of the bike sharing database, we can evaluate a number of strategic information such as most engaged users’ profile and most popular stations.

First, regarding which day of the week has the most number of bike rides, it is obviously the weekend.

We can see that the customer base of this bike sharing company is unbalanced in terms of gender: a lot more Male riders than Female. However, Female riders have a longer trip duration on average, which probably also translates in higher NPV. Also, there’s no big difference in trip duration when looking at user age.

Regarding the stations, we can see that 2 stations are the most popular for both starting and ending trips: Streeter Dr & Grand Ave and Clinton St & Washington Blvd. They must be located in very important spots of Chicago.